adaptive approximate record matching
نویسندگان
چکیده
typographical data entry errors and incomplete documents, produce imperfect records in real world databases. these errors generate distinct records which belong to the same entity. the aim of approximate record matching is to find multiple records which belong to an entity. in this paper, an algorithm for approximate record matching is proposed that can be adapted automatically with input error patterns. in field matching phase, edit distance method is used. naturally, it had been customized for persian language problems such as similarity of persian characters, usual typographical errors in persian, etc. in record matching phase, the importance of each field can be determined by specifying a coefficient related to each field. coefficient of each field must be dynamically changed, because of changes of typographical error patterns. for this reason, genetic algorithm (ga) is used for supervised learning of coefficient values. the simulation results show the high abilities of this algorithm compared with other methods (such as decision trees).
منابع مشابه
Adaptive Approximate Record Matching
Typographical data entry errors and incomplete documents, produce imperfect records in real world databases. These errors generate distinct records which belong to the same entity. The aim of Approximate Record Matching is to find multiple records which belong to an entity. In this paper, an algorithm for Approximate Record Matching is proposed that can be adapted automatically with input error...
متن کاملAutomating the approximate record-matching process
Data Quality has many dimensions one of which is accuracy. Accuracy is usually compromised by errors accidentally or intensionally introduced in a database system. These errors result in inconsistent, incomplete, or erroneous data elements. For example, a small variation in the representation of a data object, produces a unique instantiation of the object being represented. In order to improve ...
متن کاملRandom databases with approximate record matching
In many database applications in telecommunication, environmental and health sciences, bioinformatics, physics, and econometrics, real-world data are uncertain and subjected to errors. These data are processed, transmitted and stored in large databases. We consider stochastic modelling for databases with uncertain data and for some basic database operations (for example, join, selection) with e...
متن کاملCLUEMAKER : A LANGUAGE FOR APPROXIMATE RECORD MATCHING ( Practice - Oriented )
We introduce ClueMaker, the first language designed specifically for approximate record matching. Clues written in ClueMaker predict whether two records denote the same thing based on the values of the records’ attributes. For example, a clue may predict match if the records have identical values for the first name attribute. The values of the clues can then be used as input to a matching algor...
متن کاملCLUEMAKER : A LANGUAGE FOR APPROXIMATE RECORD MATCHING ( Complete Paper )
We introduce ClueMaker, the first language designed specifically for approximate record matching. Clues written in ClueMaker predict whether two records denote the same thing based on the values of the records’ attributes. For example, a clue may predict match if the records have identical values for the first name attribute. The values of the clues can then be used as input to a machine-learni...
متن کاملRecord Matching in Digital
When data stores grow large, data quality, cleaning, and integrity become issues. The commercial sector spends a massive amount of time and energy canonicalizing customer and product records as their lists of products and consumers expand. An Accenture study in 2006 found that a high-tech equipment manufacturer saved $6 million per year by removing redundant customer records used in customer ma...
متن کاملمنابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
international journal of smart electrical engineeringناشر: islamic azad university,central tehran branch
ISSN 2251-9246
دوره 03
شماره 01 2014
کلمات کلیدی
میزبانی شده توسط پلتفرم ابری doprax.com
copyright © 2015-2023